mpg
## # A tibble: 234 x 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manual… f 21 29 p comp…
## 3 audi a4 2 2008 4 manual… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto(a… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto(l… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manual… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto(a… f 18 27 p comp…
## 8 audi a4 quat… 1.8 1999 4 manual… 4 18 26 p comp…
## 9 audi a4 quat… 1.8 1999 4 auto(l… 4 16 25 p comp…
## 10 audi a4 quat… 2 2008 4 manual… 4 20 28 p comp…
## # … with 224 more rows
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_point(mapping = aes(x = hwy, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = class, y = drv))
Graphing Template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
Exercises
drv variable describe? drv describes whether the vehicle is front-wheel drive, rear-wheel drive, or 4-wheel drive.class vs drv? Why is the plot not useful? Both values are categorative and most car models offer each type of drive.# Adding a color aesthetic
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
# Adding an alpha aesthetic, which adjusts transparency
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
# Adding a shape aesthetic
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
# Manually setting aesthetic properties
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Exercises
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
mapping = aes(x = displ, y = hwy), color = "blue"))mpg are categorical? Out of 11 variables, 6 are categorical and 5 are continuousggplot(data = mpg) +
geom_point(mapping=aes(x = displ, y = hwy, color = cty))
ggplot(data = mpg) +
geom_point(mapping=aes(x= displ, y = hwy, color = hwy))
stroke aesthetic do? What shapes does it work with? Modifies the width of the border on shapes which have borders.ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(shape = 21, color = "black", fill = "white", size = 5, stroke = 5)
aes(color = displ < 5)? (Specify x and y) It does conditional aesthetics based off whether or not it fufills the criteriaggplot(data = mpg) +
geom_point(mapping=aes(color = displ < 5, x = hwy, y = cty))
To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl) # can't set a nrow here because the rows is determined by number of variables in 'cyl'
If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name, e.g. + facet_grid(. ~ cyl).
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
Exercises
ggplot(data = mpg) +
geom_point(mapping = aes(displ, hwy)) +
facet_grid(~year)
facet_grid(drv ~ cyl) mean? How do they relate to this plot? It means there is no overlap with those two variables, drive and number of cylinders, which is confirmed by the lack of dots at intersectinos in the plot below.ggplot(data = mpg) +
geom_point(mapping = aes(drv, cyl))
. do? The following code plots engine displacement against highway miles per gallon and splits the values up by drive style in the first plot and number of cylinders in the second. The . plots these facets against themselves, rather than another variableggplot(data = mpg) +
geom_point(mapping=aes(displ, hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(displ, hwy)) +
facet_grid(cyl ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? To me, the greates advantage of faceting is splitting up data that is very noisy or busy in a small range of space. It spreads out similar data much more, but it makes specific comparisoners harder since the data isn’t on the same plane.
What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments? nrow and ncol set the number of rows and columns, respectively. facet_grid() doesn’t have it because the two variables selected determines the number of rows and columns.
When using facet_grid() you should usually put the variable with more unique levels in the columns. Why? There will be more space for columns if the plot is laid out horizontally (landscape).
To change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use this code:
ggplot(data = mpg) +
geom_point(mapping = aes(displ, hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(displ, hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Every geom function in ggplot2 takes a mapping argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.
ggplot(data = mpg) +
geom_smooth(mapping = aes (displ, hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
To display multiple geoms in the same plot, add multiple geom functions to ggplot():
ggplot(data = mpg) +
geom_point(mapping = aes(displ, hwy)) +
geom_smooth(mapping = aes(displ, hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:
ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.
ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
You can use the same idea to specify different data for each layer. Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.
ggplot(data = mpg, mapping = aes(displ, hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Exercises
- line chart: geom_line() - boxplot: geom_boxplot() - histogram: geom_histogram() - area chart: geom_area()
displ and hwyggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter? show.legend = FALSE removes the legend from the graph. It was removed in an earlier example because the other two graphs we were comparing didn’t have them since there wasn’t a third categorical variables.
What does the se argument to geom_smooth() do? The se argument adds standard error bands around the line of best fit.
Will the two graphs below look different? Why/why not? No they will look the same.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(displ, hwy, line="blue")) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(displ, hwy, line="blue")) +
geom_point() +
geom_smooth(mapping = aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color= drv)) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mpg, aes(displ, hwy)) +
geom_point(size = 4, color = "white") +
geom_point(aes(color = drv))
### 3.7 Statistical Transformation
Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.
ggplot(data = diamonds) +
geom_bar(aes(x = cut))
You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
There are three reasons you might need to use a stat explicitly:
geom_bar() from count (the default) to identity. This lets me map the height of the bars to the raw values of a y variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
ggplot(demo) +
geom_bar(aes(cut, freq), stat="identity")
ggplot(diamonds) +
geom_bar(aes(cut, ..prop.., group = 1))
ggplot(diamonds) +
stat_summary(
aes(cut, depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)